ML-KNN:A Lazy Learning Approach to Multi-Label Learning
Min-Ling Zhang, Zhi-Hua Zhou*
Abstract:
Multi-label learing originated from the investigation of the text categorization problem, where each document may belong to serveral predefined topics simultaneously. In multi-label learning,the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets. In this paper, a multi-label lazy learning approach named ML-kNN is presented, which is derived from the traditional K-Nearest Neighbor algorithm. In detail, for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e.the number of neighboring instances belonging to each possible class, maximum a posteriori principle is utilized to determine the label set for the unseen instance. Experiments on three different real-world multi-label learing problems,i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization,show that ML-kNN achieves superior performance to some well-established multi-label learning algorithms.
周志华的工作
MLKNN是Z H Zhou和MM Zhang的工作,使用的是最大后验原理建的概率表
上面是由贝叶斯法则得到的
最后得到:
经过这些演化,我们可以使用这个方法进行对多类标问题的分类了。
下面是在下做的关于MLKNN的一个PPT,
我觉得还是比较详细的,但是有一点看不懂,就是为什么要从类标的相似性推导出属性的相似,再反过来用于测试数据,这有什么直观的意义还是不懂。
下面是论文的摘要和解读
ML-KNN:A Lazy Learning Approach to Multi-Label Learning
Min-Ling Zhang, Zhi-Hua Zhou*Abstract:
Multi-label learing originated from the investigation of the text categorization problem, where each document may belong to serveral predefined topics simultaneously. In multi-label learning,the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets. In this paper, a multi-label lazy learning approach named ML-kNN is presented, which is derived from the traditional K-Nearest Neighbor algorithm. In detail, for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e.the number of neighboring instances belonging to each possible class, maximum a posteriori principle is utilized to determine the label set for the unseen instance. Experiments on three different real-world multi-label learing problems,i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization,show that ML-kNN achieves superior performance to some well-established multi-label learning algorithms.解读:
和boostexter介绍的一样多类标分类问题来源于文档分类,每篇文章可能属于多个主题。在多类标问题中训练样例集中的每个样例关联的类标是一个集合,问题的目标是预测出这些类标集合。本文的ML-knn是knn方法在多类标问题上的改进方法,首先,对于一个未知类标集合的测试样例,先要找到它在训练样例集中的k个最近的邻居,然后统计来自这些邻居的类标集合的信息增益,(我不太懂),例如对每个类标来说,这k个邻居中属于所有可能类标值的个数,很拗口啊; 接着使用最大化后验概率的方法来决定这个测试样例的类标集合。本文的实验使用三个不同种类的数据集合,都是现实存在的多类标分类问题,最后作者无耻地说本方法在这三个问题上的性能或者说准确性度量比现有的一些其他方法都要好(没敢说全部)。